Tag

#multimodal ai

8 articles

Qwen3.7-Plus is Alibaba's bid to turn multimodal AI into a full-blown autonomous agent

Learn to build a simplified autonomous agent that mimics Alibaba's Qwen3.7-Plus capabilities, combining visual perception, GUI interaction, and code generation.

Jun 553

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with Native audio that runs on a 16 GB laptop

Learn about Google DeepMind's Gemma 4 12B, an innovative encoder-free multimodal model that processes vision and audio natively on consumer hardware with only 16 GB VRAM.

Jun 434

NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model

Learn about Audio Flamingo Next (AF-Next), a new AI system that understands and describes sounds like images, opening up new possibilities for accessibility and smart technology.

Apr 1370

Meta’s Muse Spark is here – and it’s closed source

Learn how to work with multimodal AI models like Meta's Muse Spark using open-source tools and libraries, even though the actual model is closed source.

Apr 8104

Qwen3.5-Omni learned to write code from spoken instructions and video without anyone training it to

Learn how to build a system that processes audio and video inputs to generate code, simulating the capabilities of multimodal AI models like Qwen3.5-Omni.

Mar 31101

Xiaomi launches three MiMo AI models to power agents, robots, and voice

Learn about Xiaomi's new MiMo AI models that combine multiple data types to create autonomous AI agents capable of controlling software, robots, and voice systems.

Mar 22138

Amazon brings Alexa+ to the UK

This explainer explores Amazon's Alexa+ service, demonstrating advanced AI concepts including multimodal processing, contextual awareness, and large language models that are reshaping conversational AI systems.

Mar 19111

7 surprisingly useful ways to use ChatGPT's voice mode, from a former skeptic

This explainer explores ChatGPT's Voice Mode technology, examining its multimodal architecture, real-time processing challenges, and implications for AI accessibility and reliability.

Mar 10130